Introduction to Bayesian Data Analysis

Chenzi Xu
MPhil DPhil (Oxon)

University of York

2022/12/04 (updated: 2022-12-12)

Outline

  1. Why Bayesian data analysis?
  2. Fundamentals
    • Random variable and distribution
    • Conditional probability
    • Marginal likelihood
  3. Bayes’ rule
  4. Bayesian analysis
  5. Computational Bayesian models in R

Why Bayesian?

1 2 3 4 5

Frequentist Framework

1 2 3 4 5


(magic!)

The Central Limit Theorem

The sampling distribution of means is normal, provided that:

  1. the sample size is large enough
  2. \(\mu\) and \(\sigma\) are defined for the probability density or mass function that generated the data

Frequentist Framework

1 2 3 4 5


Hypothetical repeated sampling

The Exponential distribution has a parameter \(\lambda\).

Mean: \(\lambda\);

Variance: \(1/\lambda^2\)

n <- 1000 # number of samples
k <- 2000 # number of experiments

# generate data
y_matrix = matrix(rep(NA,n*k),
                  ncol=k)

for(i in 1:k){
  y_matrix[,i]<-rexp(n,rate=1/10)
}

Frequentist Framework

1 2 3 4 5


Confidence Interval (CI)

It represents the range of plausible value of the \(\mu\) parameter.

If you take samples repeatedly and compute the CI each time, 95% of those CIs will contain the true population mean \(\mu\).

Frequentist Framework

1 2 3 4 5


\(p\)-value

The probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.

The probability of obtaining the observed sample statistic, or some value more extreme than that, conditional on the assumption that the null hypothesis is true.

Frequentist Framework

1 2 3 4 5


Errors and Power

  • Type I, II, M, S errors
  • Power (1-Type II error) is a function of:
    • the effect size
    • the standard deviation
    • the sample size.

Null hypothesis significance testing (NHST) is only meaningful when power is high.

Why Bayesian?

1 2 3 4 5


  • Friendly to limited data
  • Intuitive and solid model
  • Straightforward interpretation
  • Model flexibility
  • Not limited by optimisation constraints

Fundamentals

1 2 3 4 5

Random variable and distribution

1 2 3 4 5


Data: the underlying random variable (Y) producing the data

Discrete random variable

Examples:

  • Acceptability ratings on a Likert scale
  • Binary grammaticality judgements

Probability distribution \(p(Y)\):

  • Probability Mass Function (PMF)
  • PMF maps each possible outcome of Y to a value between [0, 1].

Continuous random variable

Examples:

  • Reading times or reaction times (in ms)
  • EEG signals (in microvolts)
  • f0, F1, F2…

Probability distribution \(p(Y)\):

  • Probability Density Function (PDF)
  • PDF maps a range of values of possible outcomes of Y to a value between [0, 1].

Discrete random variable: Binomial distribution

1 2 3 4 5


PMF for binomial distribution:

\[\begin{equation} \hbox{f}(k\mid n,\theta) = \binom{n}{k} \theta^{k} (1-\theta)^{n-k} \end{equation}\]

Here, \(n\) represents the total number of trials, \(k\) the number of successes, and \(\theta\) the probability of success. The term \(\binom{n}{k}\), the number of ways in which one can choose \(k\) successes out of \(n\) trials, expands to \(\frac{n!}{k!(n-k)!}\).

Continuous random variable: Normal distribution

1 2 3 4 5


PDF for normal distribution \(X \stackrel{iid}{\sim} Normal(\mu,\sigma)\):

\[\begin{equation} f(x\mid\mu,\sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right) \end{equation}\]

Assumption: Each data point in the vector of data y is independent of the others.

The kernel density of the normal PDF: \(g(x|\mu,\sigma)=\exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right)\), and \(k\int{g(x)dx=1}\).

Probability in a continuous distribution is the area under the curve \(P(X<u)=\int_{-\infty}^uf(x)dx\), and will always be zero at any point value \(P(X=x)=0\).

Continuous random variable: Normal distribution

1 2 3 4 5


Expectation and Variance

1 2 3 4 5


Expectation \(E(x)\): the weighted mean of the possible outcomes, weighted by the respective probabilities of each outcome.

Variance is defined as: \(Var(X) = E[X^2]-E[Y]^2\)

Discrete: Binomial

Expectation: \(E\left[Y\right]=\sum_y y \cdot f(y)=n\theta\)

Variance: \(Var(X)=n\theta(1-\theta)\)

  • Estimated \(\hat{\theta}=\dfrac{k}{n}\) (Maximum likelihood estimate of the true parameter \(\theta\))

Continuous: Normal

Expectation: \(E[X]=\int xf(x)dx=\mu\)

Variance: \(Var(X)=\sigma^2\)

  • Estimated \(\hat{\mu}=\bar{\mu}=\dfrac{\sum y}{n}\)
  • Estimated \(\hat{\sigma}^2=\dfrac{\sum (y-\bar{y})^2}{n-1}\) (MLE estimate)

Likelihood Function

1 2 3 4 5


The likelihood function \(\mathcal{L}(\theta \mid k,n)\) refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).

Suppose that we record \(n = 10\) trials, and observe \(k = 7\) successes (heads in coin tosses).

\[\begin{equation} \mathcal{L}(\theta \mid k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]

Marginal likelihood

1 2 3 4 5


The concept of“Integrating out a parameter”

\[\begin{equation} p(\theta \mid k=7,n=10)=\int_0^1 \binom{10}{7} \theta^{7} (1-\theta)^{10-7} d\theta \end{equation}\]

Marginal likelihood: the likelihood computed by “marginalizing” out the parameter \(\theta\). It is a kind of weighted sum of the likelihood, weighted by the possible values of the parameter.

Bivariate distributions

1 2 3 4 5


Discrete case

We have a joint PMF \(p_{X,Y} (x, y)\) for each possible pair of values of X and Y.

The marginal distributions: \[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y) \end{equation}\]

\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y) \end{equation}\]

The conditional distributions: \[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]

\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]

Bivariate distributions

1 2 3 4 5


Continuous case

PDF and CDF (\(\rho=0\)):

The joint distribution of X and Y: \[\begin{equation} \begin{pmatrix} X \\ Y \end{pmatrix}\sim \mathcal N_{2}\left(\begin{pmatrix} \mu_{X} \\ \mu_{Y} \end{pmatrix}, \begin{pmatrix} \sigma _{X}^2 & \rho_{XY}\sigma _{X}\sigma _{Y}\\ \rho_{XY}\sigma _{X}\sigma _{Y} & \sigma _{Y}^2\\ \end{pmatrix} \right) \end{equation}\]

In the variance-covariance matrix, \(\rho_{XY}\) is the correlation between X and Y; their covariance is \(Cov(X,Y) = \rho_{XY}\sigma _{X}\sigma _{Y}\).

The area under the curve sums to 1. \[\begin{equation} \iint_{S_{X,Y}} f_{X,Y}(x,y)\, dx dy = 1 \end{equation}\]

The marginal distributions:

\[\begin{equation} f_X(x) = \int_{S_Y} f_{X,Y}(x,y)\, dy \quad f_Y(y) = \int_{S_X} f_{X,Y}(x,y)\, dx \end{equation}\]

Conditional probability

1 2 3 4 5


The probability of A given B: \[\begin{equation} P(A|B)=\dfrac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)} \hbox{ where } P(B)>0 \end{equation}\]

Since the probability A and B happening is the same as the probability of B and A happening \(P(B,A)=P(A,B)\), we have:

\[\begin{equation} P(B,A)=P(B|A)P(A)=P(A|B)P(B)=P(A,B). \end{equation}\]

Rearranging terms:

\[\begin{equation} P(B|A)=\frac{P(A|B)P(B)}{P(A)} \end{equation}\]

This is Bayes’ rule.

Exercise

You are an art dealer, and have been invited to purchase what may be a lost Leonardo da Vinci masterpiece. The seller has allowed you to view the painting under two conditions: you will have five minutes to examine it and you must decide on the spot whether to submit an offer or not. The price is $100,000,000. If the painting is genuine, it would be worth $300,000,000. If it is fake, it would be worth nothing. Before examining it, you assess that there is a 20% chance that it is genuine. When you examine it, you administer a test that always yields a positive result if the painting is genuine and yields a positive result 25% of the time if is fake. The test yields a positive result.

  1. What probability would you assign to the painting being genuine?
  2. Would you buy the painting?
  3. Suppose that as you enter the building, you see a rival art dealer leaving. Would this affect your decision?
  4. Suppose you know your rival uses a different type of test. As with your test, it always yields a positive result if the painting is genuine and 25% of the time if it is fake. Would you buy the painting?
  5. Now suppose that your rival’s test yields a positive result 80% of the time if the painting is genuine and 20% of the time if it is fake. How would you assess the probability that the painting is genuine? In this case, would you buy the painting?

Bayes’ rule

1 2 3 4 5

Bayes’ rule

1 2 3 4 5


If \(y\) is a vector of data points, we have:

\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]

Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).

We can rewrite it as follows:

\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]

Prior \(f(\theta)\): the initial probability distribution of the parameter(s), before seeing the data.

Posterior \(f(\theta \mid y)\): the probability distribution of the parameters conditional on the data.

Likelihood \(f(y \mid \theta)\): the PMF (discrete case) or the PDF (continuous case) expressed as a function of parameters \(\theta\).

Marginal Likelihood \(f(y)\): It standardizes the posterior distribution to ensure that the area under the curve of the distribution sums to 1.

Bayes’ rule

1 2 3 4 5


If \(y\) is a vector of data points, we have:

\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]

Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).

We can rewrite it as follows:

\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]

Marginal Likelihood \(f(y)\): It functions as the “normalisating constant”.

\[\begin{equation} f(y)=\int f(y,\theta) d\theta = \int f(y \mid \theta)f(\theta)d\theta \end{equation}\]

Without it, we have: \[\begin{equation} f(\theta\mid y) \propto f(y\mid \theta) f(\theta) \end{equation}\]

\[\begin{equation} \hbox{Posterior} \propto \hbox{Likelihood}\times \hbox{Prior} \end{equation}\]

NHST vs Bayes

1 2 3 4 5


Frequentist

\(P(Data \mid Theory)\)

  • No prior knowledge
  • Quantifies long-run probability of finding a false positive
  • Hard cut-off decisions

Bayesian

\(P(Theory \mid Data)\)

  • Incorporates prior knowledge
  • Quantifies uncertainty around possible parameter values
  • Gradual assessment of evidence

Bayesian analysis

1 2 3 4 5

Deriving the Posterior: Analytical

1 2 3 4 5


Binomial likelihood and Beta prior

1. Choosing a likelihood

The responses follow a binomial distribution:

\[\begin{equation} f(x\mid n,\theta) = {x \choose n}\theta^{x} (1-\theta)^{n-x} \end{equation}\]

Deriving the Posterior: Analytical

1 2 3 4 5


Binomial likelihood and Beta prior

2. Choosing a prior distribution

We need a distribution that can represent our uncertainty about the probability \(\theta\) of success.

The beta distribution is commonly used as prior for proportions.

\[\begin{equation*} f(\theta)= \left\{ \begin{array}{l l} \frac{1}{B(a,b)} \theta^{a - 1} (1-\theta)^{b-1} & \textrm{if } 0< \theta < 1\\ 0 & \textrm{otherwise}\\ \end{array} \right. \end{equation*}\]

Where \(B(a,b)=\int_0^1 \theta^{a-1}(1-\theta)^{b-1}\, d\theta\), a normalizing constant that ensures that the area under the curve sums to 1.

The expectation and variance of Beta PDF:

\[\begin{equation} E[\theta]=\frac{a}{a+b} \end{equation}\] \[\begin{equation} Var(\theta)=\frac{ab}{\left(a+b\right)^{2}\left(a+b+1\right)} \end{equation}\]

Deriving the Posterior: Analytical

1 2 3 4 5


Binomial likelihood and Beta prior

3. Conjugate analysis based on Bayes’ rule

To get the posterior distribution for \(\theta\), we have (ignore the normalising constant \({x \choose n}\)):

\[\begin{equation} f(\theta\mid x) \propto f(x\mid n, \theta) f(\theta) \end{equation}\]

We multiply the likelihood and the prior: \[\begin{equation} \theta^{x} (1-\theta)^{n-x}\theta^{a-1}(1-\theta)^{b-1}=\theta^{a+x-1}(1-\theta)^{b+n-x-1} \end{equation}\]

We computed the posterior up to proportionality.

Thus, given a \(Binomial(n, x \mid \theta)\) likelihood, and a \(Beta(a,b)\) prior on \(\theta\), the posterior will be \(Beta(a+x, b+n-x)\).

Deriving the Posterior: Analytical

1 2 3 4 5


The conjugate case of the beta-binomial:

  • There are other similar conjugate cases such as the gamma-poisson case.
  • The posterior mean is a compromise between the prior mean and the sample mean.
  • Given sparse data, informative priors will play an important role.
    • Expert knowledge or gut feelings
    • Theoretical predictions
    • Existing data
    • Meta-analysis

In realistic data-analysis settings, the likelihood function will be very complex, and many parameters will be involved.

Computational Bayesian models in R

1 2 3 4 5

Deriving the Posterior: Sampling

1 2 3 4 5


Sampling algorithms

MCMC (Markov Chain Monte Carlo) sampling

  • Gibbs sampling
  • Random walk Metropolis
  • Hamiltonian Monte Carlo (HMC, by Stan)

Bayes’ rule: \[\begin{equation} \begin{aligned} p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{p(Y)}\\ p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{\int_{\Theta} p(Y|\Theta) \cdot p(\Theta) d\Theta } \end{aligned} \end{equation}\]

MCMC Sampling

1 2 3 4 5


The inversion method

It works when we know the closed form of the PDF we want to simulate from and can derive the inverse of that function.

  1. Sample one number \(u\) from \(Unif(0,1)\). Let \(u=F(z)=\int_L^zf(x)dx\).
  2. \(z=F^{-1}(u)\) is a draw from \(f(x)\).
nsim <- 10000 
samples <-rep (NA,nsim) 
for(i in 1:nsim){
  u <- runif(1,min=0,max=1)
  samples[i]<-qnorm(u)
}
hist(samples, freq=FALSE, 
     main="Standard Normal")

MCMC Sampling

1 2 3 4 5


HMC demonstration

  • High-dimensional (hierarchical) models
  • Only works with continuous parameters
  • Based on Hamiltonian dynamics \(Energy(\theta,k)=U(\theta)+KE(k)\)

In trace plot, we see a characteristic “fat hairy caterpillar” shape, and we say that the sampler has converged.

The first few samples (warm-ups) were dropped, because the initial samples may not yet be sampling from the posterior.

We usually run four parallel chains with different initial values chosen randomly. They all end up sampling from the same distribution and overlap visually (mixing).

Bayesian Analysis using Stan: brms

1 2 3 4 5


Data:

Suppose we have data from a single subject repeatedly pressing the space bar as fast as possible, without paying attention to any stimuli. The data are response times in milliseconds in each trial.

RQ: How long it takes to press a key for this subject?

library(bcogsci)
data("df_spacebar")
head(df_spacebar,n=2)
# A tibble: 2 × 2
     rt trial
  <int> <int>
1   141     1
2   138     2

Bayesian Analysis using Stan: brms

1 2 3 4 5


Assumptions: \[\begin{equation} rt_n \sim \mathit{Normal}(\mu, \sigma) \end{equation}\]

\[\begin{equation} rt_n = \mu + \varepsilon \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) \end{equation}\]

A frequentist linear model:

m<-lm(rt~1,df_spacebar)
coef(m)
(Intercept) 
   168.6399 
sigma(m)
[1] 24.9118
mean(df_spacebar$rt)
[1] 168.6399
sd(df_spacebar$rt)
[1] 24.9118

Bayesian Analysis using Stan: brms

1 2 3 4 5


library(brms)
fit_press <- brm(rt ~ 1,
  data = df_spacebar,
  family = gaussian(),
  prior = c(
    prior(uniform(0, 60000), class = Intercept, lb = 0, 
          ub = 60000),
    prior(uniform(0, 2000), class = sigma, lb = 0, 
          ub = 2000)
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000
)

SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000123 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.23 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.047495 seconds (Warm-up)
Chain 1:                0.047387 seconds (Sampling)
Chain 1:                0.094882 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.057239 seconds (Warm-up)
Chain 2:                0.046728 seconds (Sampling)
Chain 2:                0.103967 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.045737 seconds (Warm-up)
Chain 3:                0.046846 seconds (Sampling)
Chain 3:                0.092583 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.051913 seconds (Warm-up)
Chain 4:                0.047835 seconds (Sampling)
Chain 4:                0.099748 seconds (Total)
Chain 4: 

Sampling Specifications:

chains:the number of independent runs for sampling (default: 4)

iter: the number of iterations that the sampler makes to sample from the posterior distribution of each parameter (default: 2000)

warmup: the number of iterations from the start of sampling that are eventually discarded (default: half of iter)

Bayesian Analysis using Stan: brms

1 2 3 4 5


plot(fit_press)

Bayesian Analysis using Stan: brms

1 2 3 4 5


as_draws_df(fit_press) %>% head(3) 
# A draws_df: 3 iterations, 1 chains, and 4 variables
  b_Intercept sigma lprior  lp__
1         169    25    -19 -1683
2         169    25    -19 -1683
3         168    25    -19 -1683
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
as_draws_df(fit_press)$b_Intercept %>% mean()
[1] 168.6085
as_draws_df(fit_press)$b_Intercept %>% quantile(c(0.025, .975))
    2.5%    97.5% 
166.0246 171.0875 
as_draws_df(fit_press)$sigma %>% mean()
[1] 24.98079
as_draws_df(fit_press)$sigma %>% quantile(c(0.025, .975))
    2.5%    97.5% 
23.17134 26.83562 

Bayesian Analysis using Stan: brms

1 2 3 4 5


Sensitivity Analysis

  • Flat, uninformative priors: E.g., \(\mu \sim \mathit{Uniform}(-10^{20},10^{20})\)
  • Regularizing priors: E.g., \(\mu \sim \mathit{Normal}_{+}(0,1000)\)
  • Principled priors: E.g., \(\mu \sim \mathit{Normal}_{+}(250,100)\)
  • Informative priors: E.g., \(\mu \sim \mathit{Normal}_{+}(200,20)\)

Bayesian Analysis using Stan: brms

1 2 3 4 5


This sensitivity analysis showed that the posterior is not overly affected by the choice of prior.

Bayesian Analysis using Stan: brms

1 2 3 4 5


Bayesian Workflow:

  1. Define the model (likelihood and prior)
  2. Check if priors are reasonable
  • Prior predictive distribution
  • Sensitivity analysis
  1. Fit model(s)
  2. Check if posterior prediction matches data
  • Posterior predictive distribution
  • \(p(D_{pred} \mid y) = \int_\Theta p(D_{pred} \mid \Theta)p(\Theta \mid y) d\Theta\)
pp_check(fit_press, 
         ndraws = 100, 
         type = "dens_overlay")

References and Resources

This talk is dedicated to the SMLP 2022. All the materials are from https://vasishth.github.io/.